15 research outputs found

    Audio-Visual Fusion:New Methods and Applications

    Get PDF
    The perception that we have about the world is influenced by elements of diverse nature. Indeed humans tend to integrate information coming from different sensory modalities to better understand their environment. Following this observation, scientists have been trying to combine different research domains. In particular, in joint audio-visual signal processing the information recorded with one or more video-cameras and one or more microphones is combined in order to extract more knowledge about a given scene than when analyzing each modality separately. In this thesis we attempt the fusion of audio and video modalities when considering one video-camera and one microphone. This is the most common configuration in electronic devices such as laptops and cellphones, and it does not require controlled environments such as previously prepared meeting rooms. Even though numerous approaches have been proposed in the last decade, the fusion of audio and video modalities is still an open problem. All the methods in this domain are based on an assumption of synchrony between related events in audio and video channels, i.e. the appearance of a sound is approximately synchronous with the movement of the image structure that has generated it. However, most approaches do not exploit the spatio-temporal consistency that characterizes video signals and, as a result, they assess the synchrony between single pixels and the soundtrack. The results that they obtain are thus sensitive to noise and the coherence between neighboring pixels is not ensured. This thesis presents two novel audio-visual fusion methods which follow completely different strategies to evaluate the synchrony between moving image structures and sounds. Each fusion method is successfully demonstrated on a different application in this domain. Our first audio-visual fusion approach is focused on the modeling of audio and video signals. We propose to decompose each modality into a small set of functions representing the structures that are inherent in the signals. The audio signal is decomposed into a set of atoms representing concentrations of energy in the spectrogram (sounds) and the video signal is concisely represented by a set of image structures evolving through time, i.e. changing their location, size or orientation. As a result, meaningful features can be easily defined for each modality, as the presence of a sound and the movement of a salient image structure. Finally, the fusion step simply evaluates the co-occurrence of these relevant events. This approach is applied to the blind detection and separation of the audio-visual sources that are present in a scene. In contrast, the second method that we propose uses basic features and it is more focused on the fusion strategy that combines them. This approach is based on a nonlinear diffusion procedure that progressively erodes a video sequence and converts it into an audio-visual video sequence, where only the information that is required in applications in the joint audio-visual domain is kept. For this purpose we define a diffusion coefficient that depends on the synchrony between video motion and audio energy and preserves regions moving coherently with the presence of sounds. Thus, the regions that are least diffused are likely to be part of the video modality of the audio-visual source, and the application of this fusion method to the unsupervised extraction of audio-visual objects is straightforward. Unlike many methods in this domain which are specific to speakers, the fusion methods that we present in this thesis are completely general and they can be applied to all kind of audio-visual sources. Furthermore, our analysis is not limited to one source at a time, i.e. all applications can deal with multiple simultaneous sources. Finally, this thesis tackles the audio-visual fusion problem from a novel perspective, by proposing creative fusion methods and techniques borrowed from other domains such as the blind source separation, nonlinear diffusion based on partial differential equations (PDE) and graph cut segmentation

    Audio-Visual Object Extraction using Graph Cuts

    Get PDF
    We propose a novel method to automatically extract the audio-visual objects that are present in a scene. First, the synchrony between related events in audio and video channels is exploited to identify the possible locations of the sound sources. Video regions presenting a high coherence with the soundtrack are automatically labelled as being part of the audio-visual object. Next, a graph cut segmentation procedure is used to extract the entire object. The proposed segmentation approach includes a novel term that keeps together pixels in regions with high audio- visual synchrony. When longer sequences are analyzed, video signals are divided into groups of frames which are processed sequentially and propagate the information about the source characteristics forward in time. Results show that our method is able to discriminate between audio-visual sources and distracting moving objects and to adapt within a short time delay when sources pass from active to inactive and vice versa

    Audio-driven Nonlinear Video Diffusion

    Get PDF
    In this paper we present a novel nonlinear video diffusion approach based on the fusion of information in audio and video channels. Both modalities are efficiently combined into a diffusion coefficient that integrates the basic assumption in this domain, i.e. related events in audio and video channels occur approximately at the same time. The proposed diffusion coefficient depends thus on an estimate of the synchrony between sounds and video motion. As a result, information in video parts whose motion is not coherent with the soundtrack is reduced and the sound sources are automatically highlighted. Several tests on challenging real-world sequences presenting important auditive and/or visual distractors demonstrate that our approach is able to prevail regions which are related to the soundtrack. In addition, we propose an application to the extraction of audio-related video regions by unsupervised segmentation in order to illustrate the capabilities of our method. To the best of our knowledge, this is the first nonlinear video diffusion approach which integrates information from the audio modality

    Blind Audiovisual Source Separation Using Sparse Representations

    Get PDF
    In this work we present a method to jointly separate active audio and visual structures on a given mixture. Blind Audiovisual Source Separation is achieved exploiting the coherence between a video signal and a one-microphone audio track. The efficient representation of audio and video sequences allows to build relationships between correlated structures on both modalities. Video structures exhibiting strong correlations with the audio signal and that are spatially close are grouped using a robust clustering algorithm that can count and localize audiovisual sources. Using such information and exploiting audio-video correlation, audio sources are also localized and separated. To the best of our knowledge this is the first blind audiovisual source separation algorithm conceived to deal with a video sequence and the corresponding mono audio signal

    AUDIO-BASED NONLINEAR VIDEO DIFFUSION

    Get PDF
    We propose a novel non-linear video diffusion approach which is able to focus on parts of a video sequence that are relevant for applications in audio-visual analysis. The diffusion process is controlled by a diffusion coefficient based on an estimate of the synchrony between video motion and audio energy at each point of the video volume. Thus, regions whose motion is not coherent with the soundtrack are iteratively smoothed. The discretization of the proposed continuous diffusion formulation is carefully studied and its stability demonstrated. Our approach is tested in challenging situations involving sequence degradation and distracting video motion. Results show that in all cases our method is able to keep the focus of attention on the sound sources. Index Terms — Audio-visual processing, linear/nonlinear diffusion, finite difference method

    UNSUPERVISED EXTRACTION OF AUDIO-VISUAL OBJECTS

    Get PDF
    We propose a novel method to automatically detect and extract the video modality of the sound sources that are present in a scene. For this purpose, we first assess the synchrony between the moving objects captured with a video camera and the sounds recorded by a microphone. Next, video regions presenting a high coherence with the soundtrack are automatically labelled as being part of the source. This represents the starting point for an innovative video segmentation approach, whose objective is to extract the complete audiovisual object. The proposed graph-cut segmentation procedure includes an audio-visual term that links together pixels in regions with high audio-video coherence. Our approach is demonstrated on challenging sequences presenting non-stationary sound sources and distracting moving objects. Index Terms — audio-visual processing, graph cuts 1

    Assitant:

    Get PDF
    1.1 Abstract................................ 3 1.2 Problem statement.......................... 4 1.3 Overview...............................
    corecore